PERF: GH2003 Series.isin for categorical dtypes #20522

bourbaki · 2018-03-28T18:49:50Z

I have added a branching for the categorical case in Series.isin function.
I have also added a test for the most crucial cases (nans).
closes #20003

TomAugspurger

I'd prefer to move your categorical isin logic to a new method Categorical.isin.

Then in algorithms.py you can do

if is_categorical_dtype(values):
    values = getattr(values, '_values', values)  # extract Categorical from Series/Index
    # your code

Can you add an ASV for this?

TomAugspurger · 2018-03-28T19:28:47Z

pandas/core/series.py

@@ -3507,7 +3507,11 @@ def isin(self, values):
        5    False
        Name: animal, dtype: bool
        """
-        result = algorithms.isin(com._values_from_object(self), values)


I wonder if the _values_from_object can be moved to algorithms.isin? Then this could just be result = algorithms.isin(self, values)

yes let's try to do this, @Ma3aXaKa can you make this change

TomAugspurger · 2018-03-28T19:29:11Z

pandas/tests/series/test_analytics.py

@@ -1255,6 +1255,17 @@ def test_isin_empty(self, empty):
        result = s.isin(empty)
        tm.assert_series_equal(expected, result)

+    def test_isin_cats(self):


This can go in pandas/tests/categorical/test_algos.py.

TomAugspurger · 2018-03-28T19:32:49Z

I'd prefer to move your categorical isin logic to a new method Categorical.isin.

To be a bit clearer about this. I'd imagine the Categorical.isin method doing the codes and categories stuff, to see which codes the items in values would map to. Then calling algorithms.value_counts(codes, code_values), where code_values is the code that each value maps to (if any).

jschendel · 2018-03-28T20:40:29Z

doc/source/whatsnew/v0.23.0.txt

@@ -345,6 +345,7 @@ Other Enhancements
  ``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
 - :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
 - zip compression is supported via ``compression=zip`` in :func:`DataFrame.to_pickle`, :func:`Series.to_pickle`, :func:`DataFrame.to_csv`, :func:`Series.to_csv`, :func:`DataFrame.to_json`, :func:`Series.to_json`. (:issue:`17778`)
+- Performance enhancement for :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)


Can you move this to the "Performance Improvements" section? (starts around line 783).

bourbaki · 2018-03-28T22:30:36Z

@TomAugspurger Why do I need to call algorithms.value_counts(codes, code_values) in Categorical.isin? Do you mean algorithms.isin?

TomAugspurger · 2018-03-28T22:44:07Z

Sorry, I meant `isin` :)

…

On Wed, Mar 28, 2018 at 5:30 PM, Artem Bogachev ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> Why do I need to call algorithms.value_counts(codes, code_values) in Categorical.isin — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20522 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhSHMMtDX_sbWG8bTcnhm9cg6tDVks5tjA8SgaJpZM4S_K1T> .

jreback

does this have an associated issue? this would need asv benchmarks

jreback · 2018-03-30T19:13:10Z

pandas/core/algorithms.py

@@ -403,8 +403,15 @@ def isin(comps, values):
    if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)):
        values = construct_1d_object_array_from_listlike(list(values))

-    comps, dtype, _ = _ensure_data(comps)
-    values, _, _ = _ensure_data(values, dtype=dtype)
+    if not is_categorical_dtype(comps):


reverse this logic here

Yeah. I am working on asv benchmark

codecov · 2018-03-30T21:47:52Z

Codecov Report

Merging #20522 into master will increase coverage by <.01%.
The diff coverage is 93.33%.

@@            Coverage Diff             @@
##           master   #20522      +/-   ##
==========================================
+ Coverage   91.77%   91.77%   +<.01%     
==========================================
  Files         153      153              
  Lines       49257    49270      +13     
==========================================
+ Hits        45207    45220      +13     
  Misses       4050     4050

Flag	Coverage Δ
#multiple	`90.17% <93.33%> (ø)`	⬆️
#single	`41.89% <93.33%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.39% <100%> (+0.02%)`	⬆️
pandas/core/indexes/base.py	`96.63% <100%> (ø)`	⬆️
pandas/core/series.py	`93.99% <100%> (+0.09%)`	⬆️
pandas/core/arrays/categorical.py	`95.7% <90%> (-0.08%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ec74e5...7b680cd. Read the comment docs.

bourbaki · 2018-03-30T21:52:16Z

@TomAugspurger I have moved the logic to Categorical.isin. But I still need to do more careful handling of values argument of that function. What are the possible argument types that can be used in Series.isin for example? Is using _sanitize_array enough to cover all basic cases?

And I cannot understand why checks on CircleCI have failed. The same tests on my laptop are passing.
And I think my changes do not affect JSON parsing 😄 .

…-cats

TomAugspurger · 2018-04-02T21:45:41Z

Merged in master. That'll hopefully fix the circleCI failure.

TomAugspurger · 2018-04-02T21:46:24Z

@Ma3aXaKa did you run the ASV benchmarks? Or just a simple %timeit before / after? How does the performance look?

TomAugspurger · 2018-04-02T21:49:07Z

What are the possible argument types that can be used in Series.isin for example?

It looks like it can be just about anything. Did you run into issues without using _sanitize_array?

bourbaki · 2018-04-02T21:49:56Z

Yep. I have run the asv benchmark. I got the positive result. Do I need to post them here?

TomAugspurger · 2018-04-02T21:53:08Z

Sure (just the new one). I'm curious to see the performance improvement.

…

On Mon, Apr 2, 2018 at 4:50 PM, Artem Bogachev ***@***.***> wrote: Yep. I have run the asv benchmark. I got the positive result. Do I need to post them here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20522 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIvItwyqche6UatcRKiPvpfFxiyIPks5tkp0JgaJpZM4S_K1T> .

TomAugspurger · 2018-04-02T21:53:41Z

Can you also add a release note (doc/source/whatsnew/v0.23.0.txt) noting the performance improvement? Make sure to pull my commit first.

bourbaki · 2018-04-02T21:56:56Z

I've already added the line about the performance improvement to whatsnew. Do I need to add another one?

TomAugspurger · 2018-04-02T22:14:28Z

Sorry, missed that.

bourbaki · 2018-04-02T22:15:46Z

That's what I got on my laptop from asv:

       before           after         ratio
     [fac2ef1b]       [d6c39534]
-         230±6ms      17.6±0.01ms     0.08  categoricals.IsIn.time_isin_categorical_strings

jreback · 2018-04-02T22:23:19Z

pandas/core/series.py

@@ -3564,7 +3564,10 @@ def isin(self, values):
        5    False
        Name: animal, dtype: bool
        """
-        result = algorithms.isin(com._values_from_object(self), values)
+        if is_categorical_dtype(self):
+            result = self._values.isin(values)


I think a more ducklike check would work here, something like

if hasattr(self._values, 'isin'): result = self._values.isin(values) else: ......

@jreback What's the purpose? Is it better for performance?

its more generic. you don't have to know its a categorical here.

bourbaki · 2018-04-03T09:34:14Z

@jreback Changed the condition to hasattr

bourbaki · 2018-04-05T21:40:25Z

I would prefer to convert values to ndarray before this logic. Is there any function I could use to safely do it?

jreback · 2018-04-05T21:50:52Z

use np.asarray as
tom suggests

pep8speaks · 2018-04-07T09:31:14Z

Hello @Ma3aXaKa! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 25, 2018 at 10:02 Hours UTC

bourbaki · 2018-04-07T22:35:36Z

@TomAugspurger @jreback I have added the docs and fixed the problem with Python 2 (thanks to @TomAugspurger). Is there something else to be done?

jreback · 2018-04-07T22:39:24Z

pandas/tests/categorical/test_algos.py

+
+def test_isin_cats():
+    cat = pd.Categorical(["a", "b", np.nan])
+


can u add the issue number here as a comment

jreback · 2018-04-07T22:40:00Z

asv_bench/benchmarks/categoricals.py

@@ -148,3 +148,18 @@ def time_rank_int_cat(self):

    def time_rank_int_cat_ordered(self):
        self.s_int_cat_ordered.rank()
+
+
+class IsIn(object):


can u make Isin

jreback · 2018-04-07T22:40:30Z

pandas/core/algorithms.py

@@ -407,6 +407,12 @@ def isin(comps, values):
    if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)):
        values = construct_1d_object_array_from_listlike(list(values))

+    if is_categorical_dtype(comps):


if u change this to is_extension_type does everything still work?

ok, can you add a note: TODO(extension) here

bourbaki · 2018-04-09T09:52:27Z

@jreback I have renamed the benchmark class and a reference to the issue. Switching to is_extension_type doesn't work. It fails on SparseArray for example the class doesn't have isin method. I am assuming implementing these methods is the goal of #20617?

jreback · 2018-04-09T15:21:16Z

asv_bench/benchmarks/categoricals.py

+        self.sample = np.random.choice(arr, sample_size)
+        self.ts = pd.Series(arr).astype('category')
+
+    def time_isin_categorical_strings(self):


there are 4 cases in the original issue can you cover them

jreback · 2018-04-09T15:21:37Z

pandas/core/algorithms.py

@@ -407,6 +407,12 @@ def isin(comps, values):
    if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)):
        values = construct_1d_object_array_from_listlike(list(values))

+    if is_categorical_dtype(comps):


ok, can you add a note: TODO(extension) here

jreback · 2018-04-09T15:22:08Z

pandas/core/arrays/categorical.py

+            raise TypeError("only list-like objects are allowed to be passed"
+                            " to isin(), you passed a [{values_type}]"
+                            .format(values_type=type(values).__name__))
+        from pandas.core.series import _sanitize_array


can you move the import to top of function

jreback · 2018-04-09T15:24:32Z

pandas/tests/categorical/test_algos.py

+    tm.assert_numpy_array_equal(expected, result)
+
+    result = cat.isin(["a", "c"])
+    expected = np.array([True, False, False], dtype=bool)


there are a couple of tests in pandas/tests/test_algos.py that test cats for isin, can you move here

bourbaki · 2018-04-15T23:07:26Z

@jreback I've made requested changes except moving tests to a different file (see comments). But the Travis CI build is failing with an S3 error.

bourbaki · 2018-04-18T13:10:23Z

@jreback There are still errors with S3. Is this OK?

TomAugspurger · 2018-04-18T14:08:42Z

I'm looking into the s3 / moto issues today. Things should be OK on your end.

…

On Wed, Apr 18, 2018 at 8:10 AM, Artem Bogachev ***@***.***> wrote: @jreback <https://github.com/jreback> There are still errors with S3. Is this OK? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20522 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIol8ciUhkCI4qEyopTCeyK5eakb2ks5tpztDgaJpZM4S_K1T> .

bourbaki · 2018-04-24T09:53:09Z

@TomAugspurger Hi. Did you fix the issue with S3? Is there something left to do before merging the. ranch?

jreback · 2018-04-24T10:33:14Z

@Ma3aXaKa can you rebase

jreback · 2018-04-25T10:02:11Z

rebased, so let's merge on green.

jreback · 2018-04-25T12:38:43Z

thanks @Ma3aXaKa what looked like an easy issue morphed into code refactoring! thanks for the patch and patience! keep em coming!

PERF: GH2003 Series.isin for categorical dtypes

19ac11a

TomAugspurger reviewed Mar 28, 2018

View reviewed changes

jschendel reviewed Mar 28, 2018

View reviewed changes

jreback requested changes Mar 30, 2018

View reviewed changes

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Mar 30, 2018

bourbaki added 4 commits March 30, 2018 23:38

Add Categorical.isin method

54021b9

Add benchmark

2514b45

Rename benchmark

80f687a

change what's new

d6c3953

Merge remote-tracking branch 'upstream/master' into ma3axaka-isin-for…

33e3b07

…-cats

jreback requested changes Apr 2, 2018

View reviewed changes

rf: more generic check

ceffccd

bourbaki added 2 commits April 6, 2018 01:23

Fix for null mask

2b7b1c4

Add docs and raise error on non-list-like

4478a49

fix doc line

64fef49

jreback requested changes Apr 7, 2018

View reviewed changes

refactor benchmark name and add reference to issue

b25da12

jreback requested changes Apr 9, 2018

View reviewed changes

bourbaki added 6 commits April 15, 2018 20:21

add todo

9f8e790

move import from the function to the top of the file

60ac658

add int64 benchmark

50aca26

move import to the top of the function

713712e

add int64 categorical test

18c827d

Merge branch 'master' into isin-for-cats

993afd8

rename variable in benchmark

a2b70ee

jreback added this to the 0.23.0 milestone Apr 24, 2018

jreback added 2 commits April 25, 2018 05:59

Merge branch 'master' into PR_TOOL_MERGE_PR_20522

fa7f0f1

whitespace

7b680cd

jreback approved these changes Apr 25, 2018

View reviewed changes

jreback merged commit 60fe82c into pandas-dev:master Apr 25, 2018


		def test_isin_cats():
		cat = pd.Categorical(["a", "b", np.nan])

PERF: GH2003 Series.isin for categorical dtypes #20522

PERF: GH2003 Series.isin for categorical dtypes #20522

Conversation

bourbaki commented Mar 28, 2018 • edited by jreback Loading

TomAugspurger left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bourbaki commented Mar 28, 2018 • edited Loading

TomAugspurger commented Mar 28, 2018 via email

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 30, 2018 • edited Loading

Codecov Report

bourbaki commented Mar 30, 2018 • edited Loading

TomAugspurger commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

bourbaki commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018 via email

TomAugspurger commented Apr 2, 2018

bourbaki commented Apr 2, 2018 • edited Loading

TomAugspurger commented Apr 2, 2018

bourbaki commented Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bourbaki commented Apr 3, 2018

bourbaki commented Apr 5, 2018 • edited Loading

jreback commented Apr 5, 2018

pep8speaks commented Apr 7, 2018 • edited Loading

Comment last updated on April 25, 2018 at 10:02 Hours UTC

bourbaki commented Apr 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bourbaki commented Apr 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bourbaki commented Apr 15, 2018 • edited Loading

bourbaki commented Apr 18, 2018

TomAugspurger commented Apr 18, 2018 via email

bourbaki commented Apr 24, 2018 • edited Loading

jreback commented Apr 24, 2018

jreback commented Apr 25, 2018

jreback commented Apr 25, 2018

bourbaki commented Mar 28, 2018 •

edited by jreback

Loading

TomAugspurger left a comment •

edited

Loading

TomAugspurger Mar 28, 2018 •

edited

Loading

bourbaki commented Mar 28, 2018 •

edited

Loading

codecov bot commented Mar 30, 2018 •

edited

Loading

bourbaki commented Mar 30, 2018 •

edited

Loading

bourbaki commented Apr 2, 2018 •

edited

Loading

bourbaki commented Apr 2, 2018 •

edited

Loading

bourbaki commented Apr 5, 2018 •

edited

Loading

pep8speaks commented Apr 7, 2018 •

edited

Loading

bourbaki commented Apr 15, 2018 •

edited

Loading

bourbaki commented Apr 24, 2018 •

edited

Loading